This notebook explores the more recent data from NYC Open Data Data Set.
This dataset can also be reached and interacted with through its Google BigQuery location
Try a different version: handout page, Tab Navigation
Reduce Car Accidents in Brooklyn
For this exercise, we’d like you to analyze data on New York motor vehicle collisions and answer the following question:
What are your ideas for reducing accidents in Brooklyn?
Imagine you are preparing this presentation for the city council who will use it to inform new legislation and/or projects.
Briefly:
Libraries that will be used during exploration
library(magrittr)
library(dplyr)
library(ggplot2)
library(viridis)
library(plotly)
library(maps)
library(rgeos)
library(rgdal)
library(ggthemes)
library(crosstalk)
library(leaflet)
library(d3scatter)
library(d3heatmap)
library(rnoaa)
#library(ggmap)
Collect API tokens in Environment Variables (purposefully kept hidden here). Tokens and keys used include Google Maps API key (get one here, Mapbox Access Token(get one here) and an NCDC token (here) for NOAA weather data.
Load data from /data directory and into memory
dt <- read.csv(file = "data/NYPD_Motor_Vehicle_Collisions.csv")
Inspect structure of dataset with the str() command:
str(dt)
'data.frame': 990800 obs. of 29 variables:
$ DATE : Factor w/ 1709 levels "01/01/2013","01/01/2014",..: 200 200 200 200 1068 920 611 200 65 65 ...
$ TIME : Factor w/ 1440 levels "0:00","0:01",..: 556 631 632 644 661 936 46 686 1051 1066 ...
$ BOROUGH : Factor w/ 6 levels "","BRONX","BROOKLYN",..: 1 2 2 3 1 1 1 1 3 3 ...
$ ZIP.CODE : int NA 10454 10466 11218 NA NA NA NA 11218 11236 ...
$ LATITUDE : num 40.7 40.8 40.9 40.6 40.7 ...
$ LONGITUDE : num -73.9 -73.9 -73.9 -74 -73.9 ...
$ LOCATION : Factor w/ 90272 levels "","(0.0, 0.0)",..: 28545 73165 89328 17808 52600 1 1 14824 17763 18379 ...
$ ON.STREET.NAME : Factor w/ 9151 levels "","?EST 125 STREET",..: 1420 1 3493 1 1 1 6598 4069 738 7071 ...
$ CROSS.STREET.NAME : Factor w/ 9585 levels "","0","01247",..: 1 1 9364 1 1 1 7399 3638 114 4201 ...
$ OFF.STREET.NAME : Factor w/ 59908 levels "","(26 BROOKLYN TERMINAL MARKET LOT)",..: 1 38225 1 29898 1 1 1 1 1 1 ...
$ NUMBER.OF.PERSONS.INJURED : int 0 0 1 0 0 0 0 0 1 2 ...
$ NUMBER.OF.PERSONS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.INJURED: int 0 0 1 0 0 0 0 0 0 0 ...
$ NUMBER.OF.PEDESTRIANS.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.INJURED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.CYCLIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ NUMBER.OF.MOTORIST.INJURED : int 0 0 0 0 0 0 0 0 1 2 ...
$ NUMBER.OF.MOTORIST.KILLED : int 0 0 0 0 0 0 0 0 0 0 ...
$ CONTRIBUTING.FACTOR.VEHICLE.1: Factor w/ 49 levels "","Accelerator Defective",..: 10 47 47 47 47 11 1 43 43 43 ...
$ CONTRIBUTING.FACTOR.VEHICLE.2: Factor w/ 49 levels "","Accelerator Defective",..: 47 1 1 47 47 47 1 47 47 47 ...
$ CONTRIBUTING.FACTOR.VEHICLE.3: Factor w/ 43 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 42 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.4: Factor w/ 42 levels "","Accelerator Defective",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CONTRIBUTING.FACTOR.VEHICLE.5: Factor w/ 31 levels "","Aggressive Driving/Road Rage",..: 1 1 1 1 1 1 1 1 1 1 ...
$ UNIQUE.KEY : int 3612721 3612791 3618743 3614471 3284922 2833714 336679 3618925 3598095 3597360 ...
$ VEHICLE.TYPE.CODE.1 : Factor w/ 18 levels "","AMBULANCE",..: 15 10 12 15 10 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.2 : Factor w/ 18 levels "","AMBULANCE",..: 10 1 1 10 16 10 1 10 10 15 ...
$ VEHICLE.TYPE.CODE.3 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 15 1 ...
$ VEHICLE.TYPE.CODE.4 : Factor w/ 18 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ VEHICLE.TYPE.CODE.5 : Factor w/ 16 levels "","AMBULANCE",..: 1 1 1 1 1 1 1 1 1 1 ...
Inspect summary of dataset with summary() command:
summary(dt)
DATE TIME BOROUGH ZIP.CODE
01/21/2014: 1161 16:00 : 12792 :260725 Min. :10000
01/18/2015: 960 15:00 : 12748 BRONX : 95396 1st Qu.:10075
02/03/2014: 960 17:00 : 12597 BROOKLYN :223552 Median :11205
03/06/2015: 936 18:00 : 11641 MANHATTAN :187571 Mean :10808
01/07/2017: 887 14:00 : 11094 QUEENS :189619 3rd Qu.:11236
09/30/2016: 872 13:00 : 10365 STATEN ISLAND: 33937 Max. :11697
(Other) :985024 (Other):919563 NA's :260826
LATITUDE LONGITUDE LOCATION
Min. : 0.00 Min. :-201.36 :201443
1st Qu.:40.67 1st Qu.: -73.98 (40.6960346, -73.9845292): 673
Median :40.72 Median : -73.93 (40.7606005, -73.9643142): 544
Mean :40.72 Mean : -73.92 (40.7572323, -73.9897922): 485
3rd Qu.:40.77 3rd Qu.: -73.87 (40.6757357, -73.8968533): 480
Max. :40.91 Max. : 0.00 (40.6585778, -73.8906229): 464
NA's :201443 NA's :201443 (Other) :786711
ON.STREET.NAME CROSS.STREET.NAME
:188246 :217648
BROADWAY : 10832 3 AVENUE: 11407
ATLANTIC AVENUE : 9354 BROADWAY: 11088
NORTHERN BOULEVARD: 7490 2 AVENUE: 9678
3 AVENUE : 6864 5 AVENUE: 7846
FLATBUSH AVENUE : 6500 7 AVENUE: 7312
(Other) :761514 (Other) :725821
OFF.STREET.NAME NUMBER.OF.PERSONS.INJURED
:916764 Min. : 0.0000
PARKING LOT 110-00 ROCKAWAY BOULEVARD : 150 1st Qu.: 0.0000
PARKING LOT-772 EDGEWATER RD : 91 Median : 0.0000
PARKING LOT OF 110-00 ROCKAWAY BOULEVARD: 90 Mean : 0.2552
3 AVENUE : 72 3rd Qu.: 0.0000
2 AVENUE : 67 Max. :43.0000
(Other) : 73566
NUMBER.OF.PERSONS.KILLED NUMBER.OF.PEDESTRIANS.INJURED NUMBER.OF.PEDESTRIANS.KILLED
Min. :0.000000 Min. : 0.00000 Min. :0.0000000
1st Qu.:0.000000 1st Qu.: 0.00000 1st Qu.:0.0000000
Median :0.000000 Median : 0.00000 Median :0.0000000
Mean :0.001214 Mean : 0.05455 Mean :0.0006833
3rd Qu.:0.000000 3rd Qu.: 0.00000 3rd Qu.:0.0000000
Max. :5.000000 Max. :15.00000 Max. :2.0000000
NUMBER.OF.CYCLIST.INJURED NUMBER.OF.CYCLIST.KILLED NUMBER.OF.MOTORIST.INJURED
Min. :0.00000 Min. :0.00e+00 Min. : 0.0000
1st Qu.:0.00000 1st Qu.:0.00e+00 1st Qu.: 0.0000
Median :0.00000 Median :0.00e+00 Median : 0.0000
Mean :0.02093 Mean :7.47e-05 Mean : 0.1927
3rd Qu.:0.00000 3rd Qu.:0.00e+00 3rd Qu.: 0.0000
Max. :6.00000 Max. :1.00e+00 Max. :43.0000
NUMBER.OF.MOTORIST.KILLED CONTRIBUTING.FACTOR.VEHICLE.1
Min. :0.000000 Unspecified :523736
1st Qu.:0.000000 Driver Inattention/Distraction:127688
Median :0.000000 Fatigued/Drowsy : 48249
Mean :0.000463 Failure to Yield Right-of-Way : 42948
3rd Qu.:0.000000 Other Vehicular : 30393
Max. :5.000000 Backing Unsafely : 27886
(Other) :189900
CONTRIBUTING.FACTOR.VEHICLE.2
Unspecified :738985
:123724
Driver Inattention/Distraction: 37843
Other Vehicular : 17711
Fatigued/Drowsy : 13016
Failure to Yield Right-of-Way : 9087
(Other) : 50434
CONTRIBUTING.FACTOR.VEHICLE.3
:925696
Unspecified : 59537
Other Vehicular : 1225
Fatigued/Drowsy : 1122
Driver Inattention/Distraction: 1100
Pavement Slippery : 234
(Other) : 1886
CONTRIBUTING.FACTOR.VEHICLE.4
:976717
Unspecified : 12938
Fatigued/Drowsy : 222
Other Vehicular : 221
Driver Inattention/Distraction: 192
Pavement Slippery : 67
(Other) : 443
CONTRIBUTING.FACTOR.VEHICLE.5 UNIQUE.KEY
:987360 Min. : 22
Unspecified : 3186 1st Qu.: 249509
Other Vehicular : 52 Median :3131520
Fatigued/Drowsy : 48 Mean :2054070
Driver Inattention/Distraction: 36 3rd Qu.:3379220
Pavement Slippery : 23 Max. :3627969
(Other) : 95
VEHICLE.TYPE.CODE.1 VEHICLE.TYPE.CODE.2
PASSENGER VEHICLE :579372 PASSENGER VEHICLE :438701
SPORT UTILITY / STATION WAGON:218537 SPORT UTILITY / STATION WAGON:165455
TAXI : 37190 :134997
VAN : 26511 UNKNOWN : 80864
OTHER : 24699 TAXI : 31205
UNKNOWN : 20713 OTHER : 25249
(Other) : 83778 (Other) :114329
VEHICLE.TYPE.CODE.3 VEHICLE.TYPE.CODE.4
:926878 :977085
PASSENGER VEHICLE : 38181 PASSENGER VEHICLE : 8441
SPORT UTILITY / STATION WAGON: 15761 SPORT UTILITY / STATION WAGON: 3553
UNKNOWN : 3240 UNKNOWN : 583
VAN : 1401 VAN : 248
TAXI : 1163 OTHER : 205
(Other) : 4176 (Other) : 685
VEHICLE.TYPE.CODE.5
:987436
PASSENGER VEHICLE : 2072
SPORT UTILITY / STATION WAGON: 958
UNKNOWN : 94
OTHER : 52
VAN : 50
(Other) : 138
Our Dataset structure revealed the variables and their classes sapply(names(dt), function(x) paste0(x, ' is class: ', class(dt[[x]])))
The first thing to come to mind with such a factor heavy dataset is counting. Factors are not a hodgepodge collection of values, observed as they are pick up off the ground. Each Level of a factor - ideally - should have been intentionally designed. Ordered and distributed according to a purpose greater than the unit. Though not statically related as quantifiable assets, the levels in a factor are each related to one and other in addition to the group as a whole.
There are a lot of empty cells. To make sure we use a universal value for blank or Not Available we will assign the value NA to all blank cells. While munging around with the data, add what could be a valuable variable created from two current variables. The DATE and TIME variables are set up as factors. This has interesting categorical value so they will stay in the data. Rather than replace the two rows, add a third one in a POSIX date form.
dt[dt == ''] <- NA
# create date.time column for time use
#dt$date.time <- as.Date(dt$DATE, format("%m/%d/%Y"))
# use hours & min in date.time var for calculations
dt <- within(dt, {date.time = as.POSIXct(strptime(paste(DATE, TIME), "%m/%d/%Y %H:%M"))})
With Latitude and Longitude present and appearing to be fairly well documented, let’s take a quick look at how these accidents look over an interactive world map (incase of mistakes outlying somewhere aside from New York). We will use the BOROUGH variable as a factor. This gives the geographic association of each borough and allows us early forsight into anything specific about our point of interest BOURGH == "BROOKLYN"
mp <- dt %>%
plot_mapbox(lat = ~LATITUDE, lon = ~LONGITUDE,
split = ~BOROUGH, mode = 'scattermapbox') %>%
layout(mapbox = list(zoom = 9,
center = list(lat = ~(40.7), lon = ~(-74.0))))
plotly_build(mp)